Prosper Loan Data Exploratory Data Analysis


Abstract

Prosper is a peer-to-peer lending platform that aims to connect people who need money with those who have money to invest. In this Exploratory Data Analysis, I explore a Prosper dataset containing loan information for over a 100,000 people between the years 2006 and 2013.

I first sifted through the dataset, which has 82 variables, and roughly thought about which variables I was interested in exploring and which were outside the domain of my exploration. I first planned on making a subset of the data, but instead opted for using the entire dataset in case I got any new ideas midway through my exploration.

I wanted my EDA to have a purpose, to show to those who are interested the facts about the loan data. In peer-to-peer lending, there are three main stakeholders: borrowers, lenders, and the organization itself. I have decided that I want to focus on Prosper as a stakeholder, and have designed this exploration with their interests in mind (I have not ignored the borrowers and lenders, however, and there’s a lot of revelations for them as well).

The exploration is divided into 3 analytical segments in increasing order of complexity: Univariate Plots, Bivariate Plots, and Multivariate Plots, as well as a Reflection segment at the end that summarizes my experience and thoughts throughout this EDA.

Introduction


Loading the Dataset


setwd('~/Downloads')

loans <- read.csv('prosperLoanData.csv')


Dataset Structure


Let’s take a look at the sheer size of this dataset:


str(loans)
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...


Loading Libraries and Setting the Theme


The following libraries are used:


library(ggplot2)
library(knitr)
library(ggthemes)
library(gridExtra)
library(dplyr)
library(lubridate)
library(tidyr)


theme_set(theme_economist())


I chose my theme from the ggthemes Github repo


Exploring and Cleaning Variables


# Convert CreditScoreRangeLower and CreditScoreRangeUpper into a single
# CreditScore value (the average of the two)
loans <- loans %>%
         mutate(CreditScore = CreditScoreRangeLower / 2 +
                              CreditScoreRangeUpper / 2)


The variable ‘ListingCategory’ shows the reason a loan was taken, such as for debt consolidation or maybe a student loan. Unfortunately, the data is stored numerically, with the key being listed here. It makes much more sense to see the listing as what they are, rather than the arbitrary number it’s been given. For that, I’ll have to change the ‘ListingCategory’ variable from numeric to a factor:


# Convert ListingCategory from numeric to factor variable using the keys given
# in the Google Spreadsheet.
labels <- c("Debt Consolidation", "Home Improvement", "Business", "Personal Loan", "Student Use", "Auto", "Baby & Adoption", "Boat", "Cosmetic Procedure", "Engagement Ring", "Green Loans", "Household Expenses", "Large Purchases", "Medical/Dental", "Motorcycle", "RV", "Taxes", "Vacation", "Wedding", "Other", "Not Applicable")

loans$ListingCategory <- factor(loans$ListingCategory..numeric.,
                                          levels = c(1:6, 8:20, 7, 0),
                                          labels = labels)

summary(loans$ListingCategory)
## Debt Consolidation   Home Improvement           Business 
##              58308               7433               7189 
##      Personal Loan        Student Use               Auto 
##               2395                756               2572 
##    Baby & Adoption               Boat Cosmetic Procedure 
##                199                 85                 91 
##    Engagement Ring        Green Loans Household Expenses 
##                217                 59               1996 
##    Large Purchases     Medical/Dental         Motorcycle 
##                876               1522                304 
##                 RV              Taxes           Vacation 
##                 52                885                768 
##            Wedding              Other     Not Applicable 
##                771              10494              16965


# Convert dates to date class using lubridate's ymd_hms() function

x <- as.character(loans$LoanOriginationDate)
loans$LoanOriginationDate <- ymd_hms(x)

# Convert LoanOriginationQuarter to begin with the year using tidyr
# This also makes sure that any plot axis will put it in increasing order
# of year

loans$LoanOriginationQuarter <- as.character(loans$LoanOriginationQuarter)
loans <- loans %>%
         separate (col = LoanOriginationQuarter,
                   into = c("Quarters", "Year"), sep = " ") %>%
         unite(col = LoanOriginationQuarter, Year, Quarters, sep = " ")

loans$LoanOriginationQuarterF <- factor(loans$LoanOriginationQuarter)


# Join the CreditGrade (which stores credit rating pre-2009) and 
# ProsperRating..Alpha. (which stores credit rating post-2009) to make one
# variable. They aren't exactly the same thing, but they're reasonably close
# and it makes plotting time series data easier.

rating_categories <- rev(c("HR", "E", "D", "C", "B", "A", "AA"))

loans <- loans %>%
  mutate(CreditRating = ifelse(ProsperRating..Alpha. %in% rating_categories,
                               as.character(ProsperRating..Alpha.),
                               ifelse(CreditGrade %in% rating_categories,
                                      as.character(CreditGrade),
                                      NA)))

loans$CreditRating <- ordered(loans$CreditRating, labels = rating_categories)

str(loans$CreditRating)
##  Ord.factor w/ 7 levels "AA"<"A"<"B"<"C"<..: 4 1 7 1 5 3 6 4 2 2 ...


Univariate Plots


Since I’m dealing with loan data, the first thing I want to check is the loan amounts being requested. Let’s make a simple histogram showing just that:


qplot(x = LoanOriginalAmount, data = loans,
      binwidth = 1000, colour = I("#424242"), fill = I("#0077C0")) +
  scale_x_continuous(breaks = seq(0, 35000, 5000))


After adjusting the bin width and x-axis breaks, we get a nice histogram showing the loans taken ranging from $1000 to the maximum amount of $35,000. The bulk of the count lies below $15,000, which indicates that most of the borrowers using Prosper are looking for small(er) loans. The other thing I noticed was that there are huge spikes in the count on nice, round values, like $10,000, $15,000, and even $20,000 and $25,000. This makes sense; people are likelier to fixate on values that are easy to remember, like those that are factors of 5 and 10, rather than arbitrary values in between.

Next I want to visualize more data on these borrowers, and there are a variety of variables that can paint a clearer picture. Let’s see what they do, and where they’re from:


qplot(x = Occupation, data = loans, fill = I("#0077C0")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


I first ran the above code without adjusting the text angle, but since there are 68 different occupations listed, the words got all muddled together. So I decided to opt for a vertical approach.

The first thing I noticed in this histogram were the two bars a world away from the rest of the occupations - and a quick check told me that it was the count for ‘Other’ and ‘Professional’. It seems like the people signing up on Prosper are not quite willing to provide their job information, and are opting instead for these two ambiguous options. Most of the professions have pretty low counts, but the diversity is quite interesting. Among all of them, relatively popular ones include analysts, accountants, computer programmers, teachers and executives. Prosper has definitely enabled a wide range of working people to either borrow or invest money.


qplot(x = BorrowerState, data = loans, fill = I("#0077C0")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


Once again I changed the text angle so that they weren’t illegibly squished together. California is by far the biggest, and that was expected because it is (i) the state where Prosper was founded and (ii) the state with one of the highest state debt per capita according to this neat interactive data visualization (and also this one). The other popular states include Florida, New York, Texas and Illinois.

Next let’s look at some financial information:


qplot(x = IncomeRange, data = loans, fill = I("#0077C0"))


That’s a fantastic looking graph, almost perfectly normally distributed! Unfortunately, it’s not completely right. The x-axis has an arbitrary order when it should have a logical ascending order. Just needs a small adjustment…


qplot(x = IncomeRange, data = loans, fill = I("#0077C0")) +
  scale_x_discrete(limits = c("Not employed", "$0", "$1-24,999", "$25,000-49,999", "$50,000-74,999", "$75,000-99,999", "$100,000+", "Not displayed"))


…and although this doesn’t look quite as good, it’s much more coherent. The bulk of the data lies in that lower to middle-income region, whom I presume require services such as Prosper’s the most.


CreditScoreCount <- qplot(x = CreditScore, data = loans, binwidth = 20,
      color = I("#424242"), fill = I("#0077C0")) +
  scale_x_continuous(limits = c(400, 900), breaks = seq(400, 900, 50))

CreditScoreDensity <- ggplot(aes(x = CreditScore), data = loans) +
  geom_histogram(aes(y = ..density..),
                 binwidth = 20,
                 color = I("#424242"), fill = I("#0077C0")) +
  geom_density(adjust = 4, alpha = 0.3, fill = I("#DE703B")) +
  geom_vline(aes(xintercept = mean(CreditScore, na.rm = T)),
             color = I("#F7E74A"), linetype = "dashed", size = 1) +
  scale_x_continuous(limits = c(400, 900), breaks = seq(400, 900, 50))

grid.arrange(CreditScoreCount, CreditScoreDensity, ncol = 2)


The first time I ran the basic code, I got an outlier credit score around the 0 mark. Checking the summary, the results indicated that the minimum was a 9.5, with the 1st quarter at 669.5 and the median at 689.5. I then added limits in the code to get rid of the outlier, and after playing around with the bin width this was the most comprehensible histogram. The plot clearly shows that a majority of the users lie between the 650 and 750 mark, which are decent credit scores and consistent with the previous plots, especially the loan amounts received.

The plot also looked approximately normally distrubuted, so I decided to put a kernel density estimate overlay, just to see how that would look. What I got was the graph on the right, with the yellow dashed line displaying the mean Credit Score. The side on the right of the mean looks a lot more smoother than the side on the left, and I presume that may be because of the rule implemented by Prosper in 2009 - that the minimum Credit Score required for a loan would be 640.


qplot(x = ListingCategory, data = loans, 
      fill = I("#0077C0")) +
  scale_y_continuous(limits = c(0, 60000), breaks = seq(0, 60000, 10000)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


Debt Consolidation seems by and far the most popular choice, with the rest of the non-ambiguous (or ’Not Applicable) occupations much below the 10,000 mark.


Time Series Plots


Time Series plots are incredibly useful in seeing the performance of Prosper since it launched, and pick up on any interesting trends or particular disappointing declines. Since providing loans is Prosper’s business, the first thing to check would be how many loans they’ve coordinated over time (i.e. from 4th quarter 2005 to 1st quarter 2014).

Before I start plotting, however, I’m going to make a new dataset that organizes the information I need. Using the ‘tidyr’ package, I’m going to rename some variables and structure some values to be more conceivable.


loans.origination <- loans %>% 
  select(Quarter = LoanOriginationQuarter,
         Amount = LoanOriginalAmount) %>%
  group_by(Quarter) %>%
  summarise(Loans = n() / 1000,
            Dollars = sum(Amount) / 1000000) %>%
  arrange(Quarter)


Now that that’s done, I’m going to plot loan originations by quarter, starting with the last one in 2005 all the way to the first one in 2014.


ggplot(aes(x = Quarter, y = Loans), data = loans.origination) +
  geom_bar(stat = "identity", fill = I("#0077C0")) +
  geom_text(aes(label = round(Loans, 1)), vjust = -0.5, size = 4) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Loans (Thousands)")


This is a very interesting plot, almost like watching the company grow (and seeing some pitfalls along the way).

We can see the steady growth in number of loans from around 300 in 2006 Q1 to about 3100 two years later (2008 Q1) - a tenfold increase! But there’s a huge slump in the end of 2008, and which continues for the next 3 quarters (there’s no data for 2009 Q1).

A little looking up, and it turns out there was a significant event that caused this abrupt decline. In October 2008, Prosper filed to create a secondary marketplace - a place where lenders could resell or cash out their loans before its expiration - with the SEC. Doing so meant that Prosper had to enter a ‘quiet period’, where they had to stop brokering new loans (existing ones were still being serviced) until their registration with the SEC was complete.

The ‘quiet period’ ended in July 2009, which explains the 0 loans issued in 2009 Q2 (and I assume Q1 as well; there’s no data available), and the resurgance with around 600 loans in Q3.

There were a few other changes in Prosper’s financial model that they implemented during this period (either because of the SEC or simply as a means of improving). Pre-2009, Prosper operated like an eBay-like auction marketplace, where lenders and borrowers determined loan rates using a Dutch auction-style system. Soon after the SEC registration, Prosper changed their business model to using pre-set rates determined by Prosper based on a loan pricing algorithm that evaluated prospective borrower’s credit risk (Prosper Rating) and set the loan rate accordingly.

Post 2009, we can see the dramatic rise in the number of loans that were brokered, when Prosper went from around 1200 in the beginning of 2010 to 14,400 in just a mere 4 years. They were plagued by a lot of defaults in their early years, however, leading to lowering investor confidence, so it will be interesting to see if these changes have improved the success rate of loan completions in their marketplace. We’ll explore this over the course of this EDA (and some other stuff).


ggplot(aes(x = Quarter, y = Dollars), data = loans.origination) +
  geom_bar(stat = "identity", fill = I("#2EB872")) +
  geom_text(aes(label = round(Dollars, 0)), vjust = -0.5, size = 4) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Dollars Loaned (Millions)")


This plot is the same as the previous one, but with Dollar amounts instead of number of loans. The purpose of this is just to get a bit more specific, and see approximately how many dollars (in millions) are being loaned out each quarter.

Now both plots clearly show the increasing quantity of loans (and amount loaned) being organized by Prosper, which is an indicator of a growing business. However, when analyzing such services, we must also look at the quality of the loans. Increase in growth of loan originations is not constructive if most of the loans are being defaulted or charged off. So that’s what I’m going to create in the next plot: loan originations coloured by the status of the loan.

Let’s take a look at the summary of the ‘LoanStatus’ variable to see what the grouping/colouring will look like:


summary(loans$LoanStatus)
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304


Too many variables - there will be too many colours to make an effective visualization. I see that there are 6 ‘Past Due’ factors based on how late the borrowers are with their payments, but it will be much more useful visually if they’re grouped as one. Let’s write the code to do that:


loans <- loans %>%
  mutate(LoanStatusGroup = ifelse(LoanStatus %in% 
                                  c("Cancelled", "Chargedoff", "Defaulted"), 0,
                           ifelse(LoanStatus %in%
                                  c("Current", "FinalPaymentInProgress"), 2,
                           ifelse(LoanStatus %in%
                                  c("Completed"), 3,
                                  1))))

loans$LoanStatusGroup <- factor(loans$LoanStatusGroup, levels = 0:3,
                                labels = c("Defaulted", "Past Due", 
                                           "Current", "Completed"))

loans.defaults <- loans %>% 
  group_by(Quarter = LoanOriginationQuarter, LoanStatusGroup) %>%
  summarise(Loans = n() / 1000) %>%
  arrange(Quarter, LoanStatusGroup)


ggplot(aes(x = Quarter, y = Loans, fill = LoanStatusGroup), 
       data = loans.defaults) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Loans (Thousands)") +
  scale_fill_manual(values = c("#FA4659","#FBB448", "#0C9CEE", "#2EB872"))


Now this plot provides us a lot of information. Not only do we see the growth of the company through it’s increasing loan openings, but we can also see the performance of loans over time. I’ve specifically chosen the colours for the plot - the red to yellow to green denote a progression from worse to better, and the light blue represents ongoing loans that will either end up being completed or defaulted in the future.

I’ve provided a more thorough analysis of this plot in the Final Plots section in the end of the report.


Bivariate Plots


ggplot(aes(x = ProsperRating..Alpha., y = CreditScore,
           fill = I("#F5F5F5"), color = I("#506E86")), 
       data = loans) +
  geom_boxplot(lwd = 0.75, outlier.color = I("#FBB448"), outlier.shape = 1) +
  scale_x_discrete(limits = c("HR", "E", "D", "C", "B", "A", "AA")) +
  coord_cartesian(ylim = c(600, 950))


The boxplots above shows the relationship between borrower’s Prosper rating (note - this is only post-2009 data) and their credit score, and the variation in each rating category. A person’s credit score is one of the key factors in determining their Prosper Rating, so it’s no surprise that as we climb the rating categories, the credit score of the borrowers also tend to increase.

I see that ‘HR’ has a slightly higher median and IQR than ‘E’ despite being a riskier category - in fact, it’s on par with borrowers with a ‘D’ rating. Let’s take a quick look at their summaries to check the details:


by(loans$CreditScore, loans$ProsperRating..Alpha., summary)
## loans$ProsperRating..Alpha.: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   609.5   649.5   653.9   709.5   889.5     591 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   649.5   709.5   729.5   739.4   769.5   889.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   689.5   749.5   789.5   783.6   809.5   889.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   689.5   709.5   716.4   749.5   869.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   699.4   729.5   889.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   689.8   709.5   869.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   649.5   669.5   672.0   689.5   869.5 
## -------------------------------------------------------- 
## loans$ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   609.5   669.5   689.5   686.5   709.5   869.5


Now we can see that the medians of ‘HR’ and ‘D’ are the same, and the mean differs by just 3.3 to ‘D’. The median of ‘E’, on the other hand, is sandwiched between these two categories with a credit score median of 20 below.

Credit Score is one of the factors of Prosper’s rating categories, and these categories, along with some other financial information, determine the Annual Percentage Rate that a will apply to the borrower’s loan. So let’s see how the APR changes as the categories get less risky:


ggplot(aes(x = ProsperRating..Alpha., y = BorrowerAPR * 100,
           fill = I("#F5F5F5"), color = I("#506E86")),
      data = subset(loans, !is.na(BorrowerAPR))) +
  geom_boxplot(lwd = 0.75, outlier.colour = I("#FBB448"), outlier.shape = 1) + 
  scale_x_discrete(limits = c("HR", "E", "D", "C", "B", "A", "AA")) +
  ylab("Borrower APR")


The boxplots above show the relationship between borrower’s Prosper rating and their assigned Annual Percentage Rate (APR). It’s very clear that as we go down the ladder of risk - from a ‘High Risk’ to an ‘AA’ rating - the APR for the borrower reduces drastically. In fact, looking at the results of a by() function, it goes from a median APR of 35.8% for High Risk all the way to a median value of 9% for ‘AA’.

The variation in APRs also decreases as the loans get less riskier as displayed by the decreasing size of the boxes in the boxplots when going from ‘HR’ to ‘AA’. There is also a reduction in the number of outliers, which is visible by the shortening lines of yellow rings.

Next, let’s look at some line graphs:


ggplot(aes(x = Quarter, y = Loans * 1000, group = LoanStatusGroup), 
       data = filter(loans.defaults, as.numeric(LoanStatusGroup) < 2)) +
  geom_line(size = 0.9) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  scale_y_continuous(breaks = seq(0, 1500, 100)) +
  ylab("Defualted Loans")


The line graph above shows the number of loans that were defaulted over time. This is important for Prosper because they can see how frequently bad loans are made, and more importantly, to judge whether any policies - like the minimum credit score - are improving the likelihood of payment.

The 2 times the line veers below the 200 mark are misleading. The first is because of the ‘quiet’ period mentioned before, and therefore expected. The second one, however, in 2013, is because most of the loans in that period are with the ‘Current’ or ‘FinalPaymentInProgress’ status, and there just hasn’t been enough time to know whether loans are ‘Completed’ or ‘Defaulted’. Over time, that line should go higher.

In fact, a better way to show defaulted loans would be to show the rate at which loans are being defaulted. Let’s do that next:


loans.defaultrate <- loans %>%
  filter(LoanStatusGroup != "Current" & LoanStatusGroup != "Past Due") %>%
  group_by(Quarter = LoanOriginationQuarterF, LoanStatusGroup) %>%
  summarise(Loans = n()) %>%
  mutate(Rate = Loans / sum(Loans))

ggplot(aes(x = Quarter, y = Rate * 100, group = LoanStatusGroup), 
       data = subset(loans.defaultrate, LoanStatusGroup == "Defaulted")) +
  geom_line(size = 0.9) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  stat_smooth(method = "lm", size = 0.7) +
  ylab("Default Percentage")


This looks more systematic. I can see that between 2006 and 2008, the default rate hovered between 30%-40% - a pretty high rate. Once again, we see that drop, but it does not go to zero this time, and we can now see that during that period (which also happens to be after the financial crisis) there were borrowers unable to pay back their loans.

This is going to be my second final plot, and I’ll discuss it in more detail in the Final Plots section.

Next, I’m going to split that default rate by the rating that Prosper has given it, to see how well or how badly each rating category is doing in terms of defaulting.


loans.defaultratecolor <- loans %>%
  filter(LoanStatusGroup != "Current" & 
         LoanStatusGroup != "Past Due") %>%
  group_by(Quarter = LoanOriginationQuarter, LoanStatusGroup,
           CreditRating) %>%
  summarise(Loans = n()) %>%
  mutate(Rate = Loans / sum(Loans))

ggplot(aes(x = Quarter, y = Rate * 100, group = CreditRating,
           color = CreditRating), 
       data = subset(loans.defaultratecolor, 
                     LoanStatusGroup == "Defaulted",
                     na.rm = T)) +
  geom_line(size = 0.9) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  ylab("Default Percentage")


This looks like quite an overcrowded plot but there is a lot of useful information inside. For example, I notice that a loan with an ‘A’ rating has always performed better than an ‘AA’ rated one, despite being a slightly ‘riskier’ category. Also, between late 2007 and early 2010, ‘HR’ and ‘E’ loans performed better almost always than Ds, Es, and Cs, and sometimes even Bs. It’s quite interesting because this was the period during the financial crisis, and while it’s erroneous to say that they were ‘better off’, it does seem like they were more likely to repay loans than their less-riskier counterparts.

There are two huge spikes in terms of time period, one in 2009 Q2 and one in 2014 Q1. It’s actually surprising because ‘AA’ rated loans jumped up to close to 40%, and ‘C’ spiked past 60%, but it’s most likely because there were very few number of loans originating from that period, and the huge numbers are likely because of the small size (for example if there were only 2 ‘AA’ loans in the period, and 1 defaulted and 1 completed). The same goes for the 2014 phenomenon, most loans are ongoing, and the default rate is inconclusive if the sample size is very small.


ggplot(aes(x = DelinquenciesLast7Years, y = AmountDelinquent),
       data = filter(loans, AmountDelinquent > 0 & 
                            EmploymentStatus != "Other" )) +
  geom_point() +
  xlim(0, quantile(loans$DelinquenciesLast7Years, 0.99, na.rm = T)) +
  ylim(0, quantile(loans$AmountDelinquent, 0.99, na.rm = T)) +
  facet_wrap(~EmploymentStatus)


This plot doesn’t reveal much. I wanted to see the relationship between the amount borrowers were delinquent and the number of delinquencies they’ve had over the last 7 years. I then separated that by employment status to see if people that weren’t full employed had higher delinquencies or owed more money. The overplotting and general dispersion of data doesn’t really reveal the trend I hypothesized.


Multivariate Plots


loans.status <- loans %>%
  filter(LoanStatus %in% c("Completed", "Defaulted")) %>%
  select(DebtToIncomeRatio, CreditScore, LoanStatus, 
         LoanOriginationQuarter)

ggplot(aes(x = DebtToIncomeRatio, y = CreditScore,
           alpha = 0.05, color = LoanStatus),
       data = loans.status) +
  geom_point() +
  coord_cartesian(ylim = c(400, 900)) +
  scale_color_manual(values = c("#2EB872", "#FA4659"))


In this scatter plot, I notice that there is a lot of overplotting at the low end of the debt to income ratio, and a disctint line of points at a ratio of 10, with a sparse no-man’s land in between. The plot isn’t very informational. I notice that the defaulters tend to have lower credit scores - around 600 and below - displayed by the sea of red. Past 600 it seems mostly green, but once again it’s awfully overplotted and since I don’t see any other trends, I won’t take this any further.


ggplot(aes(x = DebtToIncomeRatio, y = BorrowerAPR,
           color = as.integer(CreditRating), group = CreditRating), 
       data = loans) +
  geom_jitter() +
  xlim(0, quantile(loans$DebtToIncomeRatio, 0.995, na.rm = T)) +
  scale_color_gradient2(high = "#FA4659", midpoint = 4, mid = "#FBB448",
                       low = "#2EB872",
                       breaks = seq(1, 7, 1),
                       labels = rating_categories,
                       limits = c(1, 7),
                       name = 'Category of Risk') +
  theme(legend.position = "right", 
        legend.direction = "vertical",
        legend.key.size = unit(1, "cm"))


This is a great plot with a lot of information. Here we have a scatter plot of borrower’s APR and the debt to income ratio of the borrower, with the colors describing the risk category given to the particular loan. I’ve given the legend a continuous color scale despite it being discrete variables because it displays the progression from a safe green to a risky red. I’ve also decided to include all points in the y-axis (including outliers) to show the range of rates, and I’ve limited the x-axis by removing 0.05% of the points furthest away from the median (i.e. removing outliers that spread the graph).

The first thing I notive and find interesting is that ‘A’ category loans seem to have a lower APRs and a smaller range of debt-to-income ratios, both of which indicate less risk. The rest of the plot follow the color palette and APR increases as the rating gets riskier. Another thing is that most people tend to have debt-to-income ratios below 1, regardless of risk category. Also, there is this unusual horizontal line in the ‘HR’ category that extends past 1 and all the way to 1.5, while lower ratings tend to be sparse in the 1.0+ debt-to-income ratio range.


ggplot(aes(x = LoanOriginalAmount, y = LenderYield,
           color = as.integer(CreditRating), group = CreditRating),
       data = loans) +
  geom_jitter() +
  scale_color_gradient2(high = "#FA4659", midpoint = 4, mid = "#FBB448",
                       low = "#2EB872",
                       breaks = seq(1, 7, 1),
                       labels = rating_categories,
                       limits = c(1, 7),
                       name = 'Category of Risk') +
  facet_wrap(~LoanStatusGroup, ncol = 2) +
  theme(legend.key.width = unit(1, "cm"))


This plot shows the relationship between a lender yield on the loan and the amount that a borrower has loaned. I then made individual graphs to show that relationsip based on the status of the loan - Defaulted, Past Due, Current and Completed - and finally coloured it based on risk rating.

The final result is a very interesting plot, and I will talk more about it in the Final Plots section.


Default_APR <- ggplot(aes(x = CurrentDelinquencies, y = BorrowerAPR,
           color = LoanStatusGroup),
       data = filter(loans, loans$CurrentDelinquencies > 0 &
                            loans$LoanStatusGroup == "Defaulted" &
                            !is.na(CreditRating))) +
  geom_jitter(alpha = 1/2) +
  xlim(0, quantile(loans$CurrentDelinquencies, 0.995, na.rm = T)) +
  facet_wrap(~CreditRating, ncol = 7) +
  scale_color_manual(values = c("#FA4659", "#2EB872"))

Complete_APR <- ggplot(aes(x = CurrentDelinquencies, y = BorrowerAPR,
           color = LoanStatusGroup),
       data = filter(loans, loans$CurrentDelinquencies > 0 &
                            loans$LoanStatusGroup == "Completed" &
                            !is.na(CreditRating))) +
  geom_jitter(alpha = 1/2) +
  xlim(0, quantile(loans$CurrentDelinquencies, 0.99, na.rm = T)) +
  facet_wrap(~CreditRating, ncol = 7) +
  scale_color_manual(values = "#2EB872")

grid.arrange(Complete_APR, Default_APR, ncol = 1)


This plot was made to see if there were any distinct differences in terms of completing and defaulting loans when it came to current delinquencies. Unfortunately, there doesn’t seem to be any tell-tale signs and both plots look pretty similar. However, I do notice that higher rated loans seem less diverse in terms of delinquencies and APR, and customarily lumped in the bottom left corner. As the loan gets riskier, the points get more varied and diverse, and tend to be all over the graph.


Final Plots


Plot One



I have chosen this plot because of it’s combination of detail and simplicity. It makes for an easy way to evaluate the performance of Prosper laons. I’m going to compare it as pre-2009 and post-2009, because 2009 was when they went into a ‘quiet’ period and changed their business model and also mandated a minimum credit score of 640. This plot is one way of visually seeing whether their changes have resulted in a more prosperous lending platform (sorry!).

First let’s look at pre-2009. They were still a young company at the time, and we can see that with the sub-5000 loans per quarter figure. More importantly, though, all the loans originated at the time are either completed or defaulted (i.e. none are still ongoing). Now I can compare the relative sizes of the red and green bars, and I can easily tell that approximately half, or a bit less that half the loans that were granted, defaulted.

That’s not good, especially when they have to convince investors that they’re making solid investments. Now let’s look at post-2009. Right when they restarted servicing loans, for about the next year, we can see that the size of the red bar is much smaller relative to the green bar. That tells me that their minimum credit policy seems like it’s working - defaults look pretty low in number.

I chose to look at only the next year because there seem to be no, or an insignificant amount of loans still currently active. After that - 2011 onwards - we see the number of loans being originated rise tremendously. The red bar to green bar ratio seems to be increasing slowly, but it’s difficult to tell with the growing ‘Current’ blue bar in between. But a few questions arise - how many of those current loans will end up in the green, completed group in the future? How many will instead dive into the yellow of ‘Past Due’? And out of those, how many will make it out and enter green, and how many will fall into the dreadful red of ‘Defaulted’?

It’s difficult to say whether their new policies have improved investment quality - the first year certainly suggests so, but scaling up quickly invariably leads to some new problems. Only time can tell what color the blues in the plot will convert to.

Plot Two



I’ve chosen this plot because it’s clear-cut and delivers its message precisely. The graph shows the default rate (in percentage) of loans over the years. In a way, it continues from the Plot One above, acuurately showing the default rates instead of approximating from a coloured bar. And I can validate some of the estimations I made earlier.

Pre-2009, we can see that the rate generally hung around the 30-40% mark, which is considerably high and a definite area for improvement. 2008 Quarter 4 was when Prosper stopped making new loans, which lasted uptil 2009 Quarter 3. That might be the reason for the steep drop during that period. However, the interesting thing is that they continued performing at that low default rate for the entirety of 2010, when they restarted their service with new policies.

During 2011 it went back up to around 30% and stayed in that region up until 2012. It is important to point out that 2011 onwards, there are stil loans that are still currently running, and we cannot make conclusions based on the data. Especially 2013 onwards, where most of the loans are still ongoing, and very few loans are either completed, defaulted, or past due.

Hence, that downward spike occuring around the end of 2012 uptil 2014 is quite misleading. However, we can say that default rates for complete data (pre-2011) has improved, as they are no longer touching 40%. The hovering around 30% is still quite unfavourable, though, and I’m sure Prosper would like to see that number drop over the coming years.


Plot Three



I have chosen this plot because of the vast range of questions it can answer. I can immediately see that the defaulted category doesn’t have any loans more than about $25,000. In fact, even in the Past Due and Completed sections, there aren’t a lot of loans above around the $25,000 mark. In the Current category, however, I see much more loans being taken past the $25,000 mark and even past the $30,000 mark and veering towards the maximum of $35k.

With the colouring, I see that the Defaulted and Completed categories look quite haphazard. In fact for defaulted loans, the bold red of High Risk is plastered all over the yield range, which is contrasting to the other 3 categories where the yield tends to be around 35-40%.

I personally like the Current category because of the neat, ordered rows of colours. One thing that I had pointed out before, and is very apparent here, is that the ‘A’ rating seems to get lower yields. I also see that borrowers with a riskier rating tend to go for lower loan amounts, which makes sense because they’re afraid of not being able to pay them off should any unfortunate circumstances take place, like losing a job, say. Also, it’s unlikelier for their loan listing will be accepted should they choose a high amount.

This plot always a direct comparison between Defaulted and Completed loans that we can compare. For example, I see that ‘A’ rated loans seem scarce compared to ‘AA’ rated loans in the Defaulted category, wheras in the Completed category they seem more or less even. I also see that there are very little ‘HR’ rated loans in the Current category - could it be because of Prosper’s new minimum credit score? The same amount seems to be in the Past Due section, and from Plot One we can see that there are very few total loans in that category.

I’ve just mentioned a few questions that can be answered through this plot, and I’m sure there are answers to questions I haven’t even thought of. However, I think it’s appropriate to comment on the limitations of the plot - the main one being that the overplotting often hides some of the information. For example, there could a few bold green AAs hiding amoung the yellow Cs - and that means we’re not getting the full, exact picture. Nonetheless, it is a fantastic indicator of trends, and its complexity is useful in answering questions the previous two plots could not.


Reflection


In a sentence: this was much more difficut than I thought it would be, but in the end it was incredibly rewarding. Before I started, I thought, “I’ve sort of done EDA before with P1: Exploring Titanic Dataset, how much harder can this be?” It wasn’t nearly the same; the dataset was much more sophisticated and the tools were much more refined.

When I realized this, I knew immediately that there had to be some purpose behind this EDA, some direction that I wanted to head in. I also realized that I needed to be more informed about my data and what exactly I was trying to show. So I first researched Prosper, peer-to-peer lending, loan details and financial procedures, among other things. I then planned on how I would attack the data, and concluded that there were three primary directions I could follow. I could explore it by questioning the data through the eyes of one of the three main stakeholders: borrowers, investors, and the company itself. I finally decided that I would explore it keeping in mind the requirements of Prosper - what information could I find that would be useful to improve Prosper, or identify a particular business flaw.

The main problem I ran into this dataset was overplotting. A lot of the time, I would hypothesize a relationship and then plot the data to check if it was true or not, but most of the time the sheer amount of data (and it’s variation) would disguise any trend that may exist. A solution would be to sample the data, but I chose not to do that this time (and opted instead for colouring it).

I also noticed that this project involved a LOT of googling to figure out how to do stuff. I also realized the amount of effort that’s required to get that amazing plot that simplifies complicated information. Just the sheer number of trials (and errors) I went through with this project makes me now really appreciate the wonderful visualizations I see around the Web. The combination of science and art is challenging, but quite fun, and I know that experience with datasets like this one will only improve my EDA skills.